The existence of forests is essential for our life on Earth. By covering around 31 percent of the world’s total land area, forests provide a retreat and home to over 80 percent of land animals and countless partially even undiscovered plants. One can say that forests are the backbone of entire ecosystems. A significant part of the oxygen we breathe is provided by the trees, while they also absorb about 25 percent of greenhouse gases. Also economically we are dependent on forests as about 1.6 billion people around the world earn their livelihoods with forests. Furthermore, forests provide 40 percent of today’s global renewable energy supply, as much as solar, hydroelectric and wind power combined. Despite these utilities, forestation across the world has faced several challenges ranging from wildfire, human-driven deforestation, poor management and poor conversation in general. However, a loss of whole forests would mean severe consequences to humanity and life on Earth.
With this project we seek to answer important questions that address these challenges. We want to figure out the causes of destruction of forests, highlight their importance to our environment and predict trends around reforestation/deforestation. Moreover, we hope to show how we can tackle climate change by reforestation, in particular, how an increase in the forest’s area will help to increase the buffer of sustainability. For the statistics so far, see our reference (Opened on 07th of May, 2021).
The first data set is about the forest cover of the continents. Forest land of continents
First we preprocess the data set, we change the names of te columns we later need and make them more readable
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## Domain = col_character(),
## Area = col_character(),
## Element = col_character(),
## Item = col_character(),
## Year = col_double(),
## Value = col_double()
## )
## # A tibble: 155 x 6
## Domain Continent Element Item Year ForestSize
## <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 Forest Land Africa Area Forest land 1990 742801.
## 2 Forest Land Africa Area Forest land 1991 739526.
## 3 Forest Land Africa Area Forest land 1992 736251.
## 4 Forest Land Africa Area Forest land 1993 732976.
## 5 Forest Land Africa Area Forest land 1994 729700.
## 6 Forest Land Africa Area Forest land 1995 726425.
## 7 Forest Land Africa Area Forest land 1996 723150.
## 8 Forest Land Africa Area Forest land 1997 719875.
## 9 Forest Land Africa Area Forest land 1998 716599.
## 10 Forest Land Africa Area Forest land 1999 713324.
## # … with 145 more rows
## spec_tbl_df[,6] [155 × 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Domain : chr [1:155] "Forest Land" "Forest Land" "Forest Land" "Forest Land" ...
## $ Continent : chr [1:155] "Africa" "Africa" "Africa" "Africa" ...
## $ Element : chr [1:155] "Area" "Area" "Area" "Area" ...
## $ Item : chr [1:155] "Forest land" "Forest land" "Forest land" "Forest land" ...
## $ Year : num [1:155] 1990 1991 1992 1993 1994 ...
## $ ForestSize: num [1:155] 742801 739526 736251 732976 729700 ...
## - attr(*, "spec")=
## .. cols(
## .. Domain = col_character(),
## .. Area = col_character(),
## .. Element = col_character(),
## .. Item = col_character(),
## .. Year = col_double(),
## .. Value = col_double()
## .. )
## [1] 155 6
Here we have pre-processed out data set
## # A tibble: 155 x 4
## # Groups: Continent, Year [155]
## Continent Year ForestSize AvgforestSize
## <chr> <dbl> <dbl> <dbl>
## 1 Africa 1990 742801. 742801.
## 2 Americas 1990 1728946. 1728946.
## 3 Asia 1990 569978. 569978.
## 4 Europe 1990 1009734. 1009734.
## 5 Oceania 1990 184974. 184974.
## 6 Africa 1991 739526. 739526.
## 7 Americas 1991 1723550. 1723550.
## 8 Asia 1991 570130. 570130.
## 9 Europe 1991 1010579. 1010579.
## 10 Oceania 1991 184810. 184810.
## # … with 145 more rows
## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale.
### Findings: + Forest land has decreased globally
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## Domain = col_character(),
## Area = col_character(),
## Element = col_character(),
## Item = col_character(),
## Year = col_double(),
## Unit = col_character(),
## Value = col_double()
## )
## Rows: 7,106
## Columns: 7
## $ Domain <chr> "Forest Land", "Forest Land", "Forest Land", "Forest Land",…
## $ Country <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan",…
## $ Element <chr> "Area", "Area", "Area", "Area", "Area", "Area", "Area", "Ar…
## $ Item <chr> "Forest land", "Forest land", "Forest land", "Forest land",…
## $ Year <dbl> 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999,…
## $ Unit <chr> "1000 ha", "1000 ha", "1000 ha", "1000 ha", "1000 ha", "100…
## $ ForestArea <dbl> 1208.44, 1208.44, 1208.44, 1208.44, 1208.44, 1208.44, 1208.…
## Selecting by avg_forest_size
### Findings + Russian federation has the largest forest cover in the last 30 years
## `summarise()` has grouped output by 'Country'. You can override using the `.groups` argument.
## # A tibble: 7,106 x 3
## # Groups: Country [242]
## Country Year forest_perYear
## <chr> <dbl> <dbl>
## 1 Afghanistan 1990 0.607
## 2 Afghanistan 1991 0.607
## 3 Afghanistan 1992 0.607
## 4 Afghanistan 1993 0.606
## 5 Afghanistan 1994 0.606
## 6 Afghanistan 1995 0.606
## 7 Afghanistan 1996 0.605
## 8 Afghanistan 1997 0.605
## 9 Afghanistan 1998 0.605
## 10 Afghanistan 1999 0.605
## # … with 7,096 more rows
## Selecting by avg_forest_size
### Findings + Curacao has show the least forest cover in the last 30 years
The air pollution data has air quality value in microgram cubic metre from the year 1990 to 2019. However, a few years data is not available. Air quality shows changes in the amount of pollution in the air.
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## COU = col_character(),
## Country = col_character(),
## SMALL_SUBNATIONAL_REGION = col_character(),
## `Small subnational region` = col_character(),
## LARGE_SUBNATIONAL_REGION = col_character(),
## `Large subnational region` = col_character(),
## VAR = col_character(),
## Variable = col_character(),
## YEA = col_double(),
## Year = col_double(),
## `Unit Code` = col_character(),
## Unit = col_character(),
## `PowerCode Code` = col_double(),
## PowerCode = col_character(),
## `Reference Period Code` = col_logical(),
## `Reference Period` = col_logical(),
## Value = col_double(),
## `Flag Codes` = col_logical(),
## Flags = col_logical()
## )
## Rows: 3,444
## Columns: 19
## $ COU <chr> "AUS", "AUS", "AUS", "AUS", "AUS", "AUS", "…
## $ Country <chr> "Australia", "Australia", "Australia", "Aus…
## $ SMALL_SUBNATIONAL_REGION <chr> "TOTAL", "TOTAL", "TOTAL", "TOTAL", "TOTAL"…
## $ `Small subnational region` <chr> "Total", "Total", "Total", "Total", "Total"…
## $ LARGE_SUBNATIONAL_REGION <chr> "TOTAL", "TOTAL", "TOTAL", "TOTAL", "TOTAL"…
## $ `Large subnational region` <chr> "Total", "Total", "Total", "Total", "Total"…
## $ VAR <chr> "PWM_EX", "PWM_EX", "PWM_EX", "PWM_EX", "PW…
## $ Variable <chr> "Mean population exposure to PM2.5", "Mean …
## $ YEA <dbl> 1990, 1995, 2000, 2005, 2010, 2011, 2012, 2…
## $ Year <dbl> 1990, 1995, 2000, 2005, 2010, 2011, 2012, 2…
## $ `Unit Code` <chr> "MICRO_M3", "MICRO_M3", "MICRO_M3", "MICRO_…
## $ Unit <chr> "Micrograms per cubic metre", "Micrograms p…
## $ `PowerCode Code` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ PowerCode <chr> "Units", "Units", "Units", "Units", "Units"…
## $ `Reference Period Code` <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ `Reference Period` <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ AirPollution <dbl> 7.60250, 7.49591, 7.36613, 6.90976, 6.78718…
## $ `Flag Codes` <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ Flags <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## Rows: 3,444
## Columns: 3
## Groups: Country, Year [3,444]
## $ Country <chr> "Australia", "Australia", "Australia", "Australia", "Aust…
## $ Year <dbl> 1990, 1995, 2000, 2005, 2010, 2011, 2012, 2013, 2014, 201…
## $ AirPollution <dbl> 7.60250, 7.49591, 7.36613, 6.90976, 6.78718, 6.71166, 7.0…
## # A tibble: 14 x 3
## Year TotalForestArea TotalAirPollution
## <dbl> <dbl> <dbl>
## 1 1990 2668100. 4788.
## 2 1995 2690005. 5146.
## 3 2000 2646222. 5428.
## 4 2005 2606657. 5174.
## 5 2010 2570631. 5286.
## 6 2011 2564881. 5410.
## 7 2012 2586020. 5543.
## 8 2013 2580096. 5380.
## 9 2014 2574171. 5257.
## 10 2015 2568247. 5592.
## 11 2016 2562096. 5375.
## 12 2017 2555945. 5256.
## 13 2018 2549794. 5256.
## 14 2019 2543643. 5231.
## `geom_smooth()` using formula 'y ~ x'
Findings: No linear relationship between Forest Area and Air Pollution
## [1] "Kendall = -0.032967032967033"
Findings: A negative correlation coefficient shows the variables are moving in the opposite direction.
## parsnip model object
##
## Fit time: 3ms
##
## Call:
## stats::lm(formula = ForestArea ~ AirPollution, data = data)
##
## Coefficients:
## (Intercept) AirPollution
## 15732 -116
Findings:
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 15732. 1541. 10.2 4.73e-24
## 2 AirPollution -116. 48.7 -2.38 1.74e- 2
Findings: Very low p-value signifies that Forest Area and Air Pollution are independent of each other.
## `geom_smooth()` using formula 'y ~ x'
Finding:
As our data set is non-linearly associated, the residual plot here assesses the appropriateness of our linear regression
In the plot the residual of data points are away from zero show the model is not a good fit
we implement Predictive Power Score
## Warning in score(df, x = param_grid[["x"]][i], y = param_grid[["y"]][i], : There are on average only 2.8 observations in each test-set for the ForestArea-AirPollution relationship.
## Model performance will be highly instable. Fewer cv_folds are advised.
## Warning in score(df, x = param_grid[["x"]][i], y = param_grid[["y"]][i], : There are on average only 2.8 observations in each test-set for the AirPollution-ForestArea relationship.
## Model performance will be highly instable. Fewer cv_folds are advised.
## Warning in pal_name(palette, type): Unknown palette Green
Findings:
Predictive Power Score of 0 shows no relationship between Forest Area and Air Pollution
## Selecting by avgAirPollution
Findings:
India has the highest air pollution in the last 30 years
## Selecting by avgAirPollution
Finding: Nauru has the least air pollution in the last 30 years.
Findings: * Air Pollution drastically increased after the year 2005. * It was highest in 2012. * It decreased in 2005 by ~10 Micrograms per cubic metre. * Sudden decrease around 2017. * Increased slightly in 2019 again
Findings: * Air Pollution varies in range of 5 Micrograms per cubic metre and ~7 Micrograms per cubic metre. * Least in 2014
This part deals with the questions related to forest deforestation and environment:
Here the visualization of forest is shown country wise by hovering over it.
Countries are being showed in different colors depending on the percentage of of forest lost in last three decades. countries with most percentage of forest lost are the main drivers of deforestation
note: for some countries deforestation data was shown as zero or not available , for them deforestation was found by finding the difference in forest area between 1990 and 2020
Below Bargraph plot shows trends in the deforestation over the last three decades over 5 continents
note : here Americas refer to North and South America. while oceania refer to Australasia, Melanesia, Micronesia and Polynesia.
Below world map plot shows the amount of forest area required to tackle the current Co2 emissions country wise. As we had only Co2 emission data for every country from 1990-2018 , we have used Arima model time series prediction to predict future Co2 emission. for this purpose a function was created to predict future Co2 emission for every country.
By taking into consideration, The amount of metric tons Co2 absorbed by forest (that is one acre of new forest can sequester about 2.5 tons of carbon annually).the predicted Co2 value is used to find the amount of additional forest required.
In this part we want to dig in deeper into the topic of the reforestation by analyzing which countries are the main drivers, whether there is a correlation between reforestation and deforestation and showing the trends over continents.
As these results show mainly countries with a huge surface, we want to put the increase of reforestation from 1990-2020 in relation to the forest area in 1990.
(source of idea https://plotly.com/r/choropleth-maps/#introduction-main-parameters-for-choropleth-outline-maps)
First of all we want to show the relation of total reforestation and deforestation in the last 30 years, to get a first impression
As there are many outliers with either very high deforestation/reforestation figures, we zoom in by changing the scale.
The figure shows already a high and not linear distrubtion of our data.
With the Shapiro-Wilk test we want to show the normality of our data.
##
## Shapiro-Wilk normality test
##
## data: corref$totalref
## W = 0.25387, p-value < 0.00000000000000022
##
## Shapiro-Wilk normality test
##
## data: corref$totaldef
## W = 0.16293, p-value < 0.00000000000000022
The values are below 0.05 for both, reforestation and deforestation, the data significantly deviate from a normal distribution. A result which was already highlighted by the graph.
As the data is therefore not linear, we should choose the Spearman method to calculate the correlation.
## # A tibble: 2 x 3
## term totalref totaldef
## <chr> <dbl> <dbl>
## 1 totalref NA 0.206
## 2 totaldef 0.206 NA
##
## Pearson's product-moment correlation
##
## data: corref$totalref and corref$totaldef
## t = 3.2181, df = 234, p-value = 0.001473
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.08027866 0.32502440
## sample estimates:
## cor
## 0.2058687
## # A tibble: 2 x 3
## term totalref totaldef
## <chr> <dbl> <dbl>
## 1 totalref NA 0.451
## 2 totaldef 0.451 NA
##
## Spearman's rank correlation rho
##
## data: corref$totalref and corref$totaldef
## S = 1201838, p-value = 0.0000000000003009
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.4513834
With a value of 0.451 it shows a moderate correlation, which means, that deforestation has actually an impact on reforestation and a relationship exists.
The next part of our project is about forest destruction and tries to answer the following questions:
FAO. 2020. Global Forest Resources Assessment 2020 - data set 2
## Rows: 4,248
## Columns: 14
## $ regions <chr> "North and Central America", "North and Central Ameri…
## $ iso3 <chr> "ABW", "ABW", "ABW", "ABW", "ABW", "ABW", "ABW", "ABW…
## $ name <chr> "Aruba", "Aruba", "Aruba", "Aruba", "Aruba", "Aruba",…
## $ year <dbl> 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008,…
## $ boreal <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ temperate <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ tropical <dbl> 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100…
## $ subtropical <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `5a_insect` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ `5a_diseases` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ `5a_weather` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ `5a_other` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ `5b_fire_land` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ `5b_fire_forest` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
First we remove everything that we won’t need and make column names (of columns that we will later need) more readable:
We also replace missing values with 0, because there is now safe way to impute these values. However this implies that our results for this part of the project are probably an underestimation.
## # A tibble: 4,248 x 13
## name continent year Insects Diseases Weather Others Fire boreal temperate
## <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Aruba Americas 2000 0 0 0 0 0 0 0
## 2 Aruba Americas 2001 0 0 0 0 0 0 0
## 3 Aruba Americas 2002 0 0 0 0 0 0 0
## 4 Aruba Americas 2003 0 0 0 0 0 0 0
## 5 Aruba Americas 2004 0 0 0 0 0 0 0
## 6 Aruba Americas 2005 0 0 0 0 0 0 0
## 7 Aruba Americas 2006 0 0 0 0 0 0 0
## 8 Aruba Americas 2007 0 0 0 0 0 0 0
## 9 Aruba Americas 2008 0 0 0 0 0 0 0
## 10 Aruba Americas 2009 0 0 0 0 0 0 0
## # … with 4,238 more rows, and 3 more variables: tropical <dbl>,
## # subtropical <dbl>, 5b_fire_land <dbl>
FAO. 2021. FAOSTAT Temperature Change Dataset
## Rows: 537,370
## Columns: 10
## $ `Area Code` <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ Area <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanist…
## $ `Months Code` <dbl> 7001, 7001, 7001, 7001, 7001, 7001, 7001, 7001, 7001, 7…
## $ Months <chr> "January", "January", "January", "January", "January", …
## $ `Element Code` <dbl> 7271, 7271, 7271, 7271, 7271, 7271, 7271, 7271, 7271, 7…
## $ Element <chr> "Temperature change", "Temperature change", "Temperatur…
## $ `Year Code` <dbl> 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1…
## $ Year <dbl> 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1…
## $ Value <dbl> 0.746, 0.009, 2.695, -5.277, 1.827, 3.629, -1.436, 0.38…
## $ Flag <chr> "Fc", "Fc", "Fc", "Fc", "Fc", "Fc", "Fc", "Fc", "Fc", "…
Again we remove everything that we won’t need:
## # A tibble: 151,621 x 5
## Area Months Element Year Value
## <chr> <chr> <chr> <dbl> <dbl>
## 1 Afghanistan January Temperature change 2000 1.60
## 2 Afghanistan January Temperature change 2001 -0.569
## 3 Afghanistan January Temperature change 2002 1.64
## 4 Afghanistan January Temperature change 2003 2.54
## 5 Afghanistan January Temperature change 2004 2.74
## 6 Afghanistan January Temperature change 2005 0.172
## 7 Afghanistan January Temperature change 2006 -1.49
## 8 Afghanistan January Temperature change 2007 0.582
## 9 Afghanistan January Temperature change 2008 -5.40
## 10 Afghanistan January Temperature change 2009 1.63
## # … with 151,611 more rows
After preprocessing we can start answering the questions.
For the first two questions we decided to look at a global scale and for the country Germany.
For the third question we decided to look at a global scale and for the continent Europe.
Note: In our interactive shiny website you can pick the country / continent of your interest.
Our first visualization doesn’t suggest any linear correlation, but it can be improved to make the interpretation clearer.
Next we compare global yearly temperature changes to global forest area destructed by wildfires.
## `summarise()` has grouped output by 'year'. You can override using the `.groups` argument.
There is no linear correlation visible for both land and forest fires and rising temperatures.
Now we compare global yearly temperature changes to the global count of wildfires.
Again there is no linear correlation visible for both land and forest fires and rising temperatures. Maybe we can find a correlation if go more into detail and show not only global values but values for each country and each year:
We need a way to quantify these results. Since the data is clearly not linear we use the Predicitive Power Score [..] as a measure for correlation.
## Warning in score(df, x = param_grid[["x"]][i], y = param_grid[["y"]][i], : There are on average only 3.6 observations in each test-set for the Wildfires-Temperature relationship.
## Model performance will be highly instable. Fewer cv_folds are advised.
## Warning in score(df, x = param_grid[["x"]][i], y = param_grid[["y"]][i], : There are on average only 3.6 observations in each test-set for the Temperature-Wildfires relationship.
## Model performance will be highly instable. Fewer cv_folds are advised.
## Warning in score(df, x = param_grid[["x"]][i], y = param_grid[["y"]][i], : There are on average only 3.6 observations in each test-set for the Wildfires-Temperature relationship.
## Model performance will be highly instable. Fewer cv_folds are advised.
## Warning in score(df, x = param_grid[["x"]][i], y = param_grid[["y"]][i], : There are on average only 3.6 observations in each test-set for the Temperature-Wildfires relationship.
## Model performance will be highly instable. Fewer cv_folds are advised.
First we look at the available data and see which columns might help us to make a prediction.
## # A tibble: 151,621 x 5
## Area Months Element Year Value
## <chr> <chr> <chr> <dbl> <dbl>
## 1 Afghanistan January Temperature change 2000 1.60
## 2 Afghanistan January Temperature change 2001 -0.569
## 3 Afghanistan January Temperature change 2002 1.64
## 4 Afghanistan January Temperature change 2003 2.54
## 5 Afghanistan January Temperature change 2004 2.74
## 6 Afghanistan January Temperature change 2005 0.172
## 7 Afghanistan January Temperature change 2006 -1.49
## 8 Afghanistan January Temperature change 2007 0.582
## 9 Afghanistan January Temperature change 2008 -5.40
## 10 Afghanistan January Temperature change 2009 1.63
## # … with 151,611 more rows
## # A tibble: 4,248 x 13
## name continent year Insects Diseases Weather Others Fire boreal temperate
## <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Aruba Americas 2000 0 0 0 0 0 0 0
## 2 Aruba Americas 2001 0 0 0 0 0 0 0
## 3 Aruba Americas 2002 0 0 0 0 0 0 0
## 4 Aruba Americas 2003 0 0 0 0 0 0 0
## 5 Aruba Americas 2004 0 0 0 0 0 0 0
## 6 Aruba Americas 2005 0 0 0 0 0 0 0
## 7 Aruba Americas 2006 0 0 0 0 0 0 0
## 8 Aruba Americas 2007 0 0 0 0 0 0 0
## 9 Aruba Americas 2008 0 0 0 0 0 0 0
## 10 Aruba Americas 2009 0 0 0 0 0 0 0
## # … with 4,238 more rows, and 3 more variables: tropical <dbl>,
## # subtropical <dbl>, 5b_fire_land <dbl>
## # A tibble: 3,809 x 14
## name continent year Insects Diseases Weather Others wildfire boreal
## <chr> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
## 1 Aruba Americas 2003 0 0 0 0 No 0
## 2 Aruba Americas 2004 0 0 0 0 No 0
## 3 Aruba Americas 2005 0 0 0 0 No 0
## 4 Aruba Americas 2006 0 0 0 0 No 0
## 5 Aruba Americas 2007 0 0 0 0 No 0
## 6 Aruba Americas 2010 0 0 0 0 No 0
## 7 Aruba Americas 2011 0 0 0 0 No 0
## 8 Aruba Americas 2012 0 0 0 0 No 0
## 9 Aruba Americas 2013 0 0 0 0 No 0
## 10 Aruba Americas 2014 0 0 0 0 No 0
## # … with 3,799 more rows, and 5 more variables: temperate <dbl>,
## # tropical <dbl>, subtropical <dbl>, 5b_fire_land <dbl>, temp_increase <dbl>
We got roughly 3800 samples for the prediction
## # A tibble: 2 x 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.595 5 0.00994 Preprocessor1_Model1
## 2 roc_auc binary 0.575 5 0.0130 Preprocessor1_Model1
## # A tibble: 2 x 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.586 5 0.0141 Preprocessor1_Model1
## 2 roc_auc binary 0.533 5 0.00705 Preprocessor1_Model1
## x Fold2: preprocessor 1/1, model 1/1 (predictions): Error in model.frame.default(...
## # A tibble: 2 x 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.929 4 0.00474 Preprocessor1_Model1
## 2 roc_auc binary 0.930 4 0.00549 Preprocessor1_Model1
## Warning: Cannot retrieve the data used to build the model (model.frame: Objekt 'wildfire' nicht gefunden).
## To silence this warning:
## Call rpart.plot with roundint=FALSE,
## or rebuild the rpart model with model=TRUE.
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.924
## Warning: Cannot retrieve the data used to build the model (model.frame: Objekt 'wildfire' nicht gefunden).
## To silence this warning:
## Call rpart.plot with roundint=FALSE,
## or rebuild the rpart model with model=TRUE.
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.736
The first dataset containing data about the volume of carbon emitted per country per year (1990-2020) is processed here;
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## .default = col_double(),
## Country = col_character(),
## `Data source` = col_character(),
## Sector = col_character(),
## Gas = col_character(),
## Unit = col_character(),
## `1990` = col_character()
## )
## ℹ Use `spec()` for the full column specifications.
## Warning: 20 parsing failures.
## row col expected actual file
## 1933 2010 a double N/A 'historical_emissions.csv'
## 1933 2009 a double N/A 'historical_emissions.csv'
## 1933 2008 a double N/A 'historical_emissions.csv'
## 1933 2007 a double N/A 'historical_emissions.csv'
## 1933 2006 a double N/A 'historical_emissions.csv'
## .... .... ........ ...... ..........................
## See problems(...) for more details.
## Warning: NAs durch Umwandlung erzeugt
## Selecting by avgEmission
## Warning in RColorBrewer::brewer.pal(10, "Reds"): n too large, allowed maximum for palette Reds is 9
## Returning the palette you asked for with that many colors
This dataset captures the carbon stock per country for selected 9 years (1990,2000,2010,2015,2016,2017,2018,2019,2020)
## Warning: Number of logged events: 31
## New names:
## * `` -> ...12
## * `` -> ...13
## * `` -> ...14
## * `` -> ...15
## * `` -> ...16
## * ...
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## Domain = col_character(),
## Area = col_character(),
## Element = col_character(),
## Item = col_character(),
## Year = col_double(),
## Unit = col_character(),
## Value = col_double()
## )
The trend of carbon stock globally from 1990 to 2018 was shown below
Comparatively, the trend of the emitted GHG from 1990 to 2018 was also shown
Comparatively, the trend of the emitted GHG from 1990 to 2018 was also shown
Here we visualized the countries with the largest carbon stock
## Selecting by avgCarbon
We seek to answer the correlation question here, we start by comparing both variables per year.
## `geom_smooth()` using formula 'y ~ x'
## [1] "Kendall = -0.954415954415954"
## [1] "Spearman = -0.992673992673993"
## [1] "Pearson = -0.871214159804568"
Interpretation of the correlation coefficient; both variables are negatively correlated using all correlation methods. Most likely, if the emission increases, the carbon stock decreases. However, this relationship can not be ascertained using correlation because correlation does not necessarily imply causation.
Also, the absolute value of approximately 1.0 depicts perfectly linear correlation between both variables.
Here, the objective of our analysis is to investigate how much GHG will be absorbed if forest area is increased. To do this, we use the carbon stock and forest area datasets already loaded.
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## country = col_character(),
## `1990` = col_double(),
## `2000` = col_double(),
## `2010` = col_double(),
## `2015` = col_double(),
## `2016` = col_double(),
## `2017` = col_double(),
## `2018` = col_double(),
## `2019` = col_double(),
## `2020` = col_double()
## )
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## Domain = col_character(),
## Area = col_character(),
## Element = col_character(),
## Item = col_character(),
## Year = col_double(),
## Unit = col_character(),
## Value = col_double()
## )
## Warning: Number of logged events: 2
## `geom_smooth()` using formula 'y ~ x'
The result below was gotten after using linear regression
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 122. 1462. 0.0835 0.936
## 2 totalArea 0.00250 0.000367 6.81 0.000252
Carbon_absorbed = 392 + 0.00243 * totalArea
For a unit increase in total forest area, the volume of carbon absorbed should increase by 0.00243
If there are no forests ie totalArea = 0, the volume of carbon absorbed will be 392.
Therefore, a unit increase in forest area, results in a decrease in the quantity of emitted CO2 by 0.00243
PPS Score code from ex_6_codealong session